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Abstract 

Deep convolutional neural networks (CNNs) have had 
a major impact in most areas of image understand¬ 
ing, including object category detection. In object 
detection, methods such as R-CNN have obtained ex¬ 
cellent results by integrating CNNs with region pro¬ 
posal generation algorithms such as selective search. 
In this paper, we investigate the role of proposal gen¬ 
eration in CNN-based detectors in order to determine 
whether it is a necessary modelling component, car¬ 
rying essential geometric information not contained 
in the CNN, or whether it is merely a way of acceler¬ 
ating detection. We do so by designing and evaluat¬ 
ing a detector that uses a trivial region generation 
scheme, constant for each image. Combined with 
SPP, this results in an excellent and fast detector that 
does not require to process an image with algorithms 
other than the CNN itself. We also streamline and 
simplify the training of CNN-based detectors by in¬ 
tegrating several learning steps in a single algorithm, 
as well as by proposing a number of improvements 
that accelerate detection. 

1 Introduction 

Object detection is one of the core problems in im¬ 
age understanding. Until recently, the best perform¬ 
ing detectors in standard benchmarks such as PAS¬ 
CAL VOC were based on a combination of hand¬ 
crafted image representations such as SIFT, HOG, 
and the Fisher Vector and a form of structured out¬ 
put regression, from sliding window to deformable 
parts models. Recently, however, these pipelines have 
been outperformed significantly by the ones based on 



Figure 1: Some examples of the bounding box regres¬ 
sor outputs. The dashed box is the image-agnostic 
proposal, correctly selected despite the bad overlap, 
and the solid box is the result of improving it by using 
the pose regressor. Both steps use the same CNN, 
but the first uses the geometrically-invariant fully- 
connected layers, and the last the geometry-sensitive 
convolutional layers. In this manner, accurate object 
location can be recovered without using complemen¬ 
tary mechanisms such as selective search. 

deep learning that acquire representations automati¬ 
cally from data using Convolutional Neural Networks 
(CNNs). Currently, the best CNN-based detectors 
are based on the R-CNN construction of [9]. Con¬ 
ceptually, R-CNN is remarkably simple: it samples 
image regions using a proposal mechanism such as 
Selective Search (SS; jIH|) and classifies them as fore¬ 
ground and background using a CNN. Looking more 
closely, however, R-CNN leaves open several interest- 
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ing question. 

The first question is whether CNN contain suffi¬ 
cient geometric information to iocalise objects, or 
whether the latter must be supplemented by an exter¬ 
nal mechanism, such as region proposal generation. 
There are in fact two hypothesis. The first one is that 
the only role of proposal generation is to cut down 
computation by allowing to evaluate the CNN, which 
is expensive, on a small number of image regions. 
In this case proposal generation becomes less impor¬ 
tant as other speedups such as SPP-CNN m become 
available and may be forego. The second hypothesis, 
instead, is that proposal generation provides geomet¬ 
ric information vital for accurate object localisation 
which is not represented in the CNN. This is not un¬ 
likely, given that CNNs are often trained to be highly 
invariant to even large geometric deformations and 
hence may not be sensitive to an object’s location. 
This question is answered in Section |3.1| by show¬ 
ing that the convolutional layers of standard CNNs 
contain sufficient information to localise objects (Fig¬ 
ure [l]). 

The second question is whether the R-CNN 
pipeline can be simplified. While conceptually 
straightforward, in fact, R-CNN comprises many 
practical steps that need to be carefully implemented 
and tuned to obtain a good performance. To start 
with, R-CNN builds on a CNN pre-trained on an im¬ 
age classification tasks such as ImageNet ILSVRC [5] . 
This CNN is ported to detection by: i) learning an 
SVM classifier for each object class on top of the last 
fully-connected layer of the network, ii) fine-tuning 
the CNN on the task of discriminating objects and 
background, and iii) learning a bounding box regres¬ 
sor for each object class. Section [T2| simplifies these 
steps, which require running a mix of different soft¬ 
ware on cached data, by training a single CNN ad¬ 
dressing all required tasks. 

The third question is whether R-CNN can be accel¬ 
erated. A substantial speedup was already obtained 
in spatial pyramid pooling (SPP) by jllj by realising 
that convolutional features can be shared among dif¬ 
ferent regions rather than being recomputed. How¬ 
ever, this does not accelerate training, and in testing 
the region proposal generation mechanism becomes 
the new bottleneck. The combination of dropping 


proposal generation and of the other simplifications 
are shown in Section [4] to provide a substantial de¬ 
tection speedup - and this for the overall system, not 
just the CNN part. Our findings are summarised in 
Section [5j 

Related work. The basis of our work are the 
current generation of deep CNNs for image under¬ 
standing, pioneered by M- For object detection, 
our method builds directly on the R-CNN approach 
of 0 as well as the SPP extension proposed in ll2l . 
All such methods rely not only on CNNs, but also 
on a region proposal generation mechanism such as 
SS HD, CPMC [3], multi-scale combinatorial group¬ 
ing 0, and edge boxes (23. These methods, which 
are extensively reviewed in |13j . originate in the idea 
of “objectness” proposed by [Tj. Interestingly, m 
showed that a good region proposal scheme is essen¬ 
tial for R-CNN to work well. Here, we show that 
this is in fact not the case provided that bounding 
box locations are corrected by a strong CNN-based 
bounding box regressor, a step that was not evaluated 
for R-CNNs in [13]. The R-CNN and SPP-CNN de¬ 
tectors build on years of research in object detection. 
Both can be seen as accelerated sliding window detec¬ 
tors EDI Sj- The two-stage computation using region 
proposal generation is a form of cascade detector [20] 
or jumping window mm- However, they differ in 
part-based detector such as [7] in that they do not ex¬ 
plicitly model object parts in learning; instead parts 
are implicitly capture in the CNN. Integrated train¬ 
ing of SPP-CNN as a single CNN learning problem, 
not dissimilar to some of the ideas of Section |3.2[ 
have very recently been explored in the unpublished 
manuscript [3]. 

2 CNN-based detectors 

This section introduces the R-CNN (Section |2.1[ ) and 
SPP-CNN (Section [2(2| detectors. 

2.1 R-CNN detector 

The R-CNN method [9] is a chain of conceptually 
simple steps: generating candidate object regions, 
classifying them as foreground or background, and 
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post-processing them to improve their fit to objects. 
These steps are described next. 

Region proposal generation. R-CNN starts by 
running an algorithm such as SS [IS] or CPMC ;3] 
to extracts from an image x a shortlist of of image 
regions R £ 72.(x) that are likely to contain objects. 
These proposals, in the order of a few thousands per 
image, may have arbitrary shapes, but in the follow¬ 
ing are assumed to be converted to rectangles. 

CNN-based features. Candidate regions are de¬ 
scribed by CNN features before being classified. The 
CNN itself is transferred from a different problem - 
usually image classification in the ImageNet ILSVRC 
challenge [5]. In this manner, the CNN can be trained 
on a very large dataset, as required to obtain good 
performance, and then applied to object detection, 
where datasets are usually much smaller. In order to 
transfer a pre-trained CNN to object detection, its 
last few layers, which are specific to the classification 
task, are removed; this results in a “beheaded” CNN 
<j) that outputs relatively generic features. The CNN 
is applied to the image regions R by cropping and re¬ 
sizing the image x, i.e. </>rcnn(x; R) = </>(resize(x|fl)). 
Cropping and resizing serves two purposes: to localise 
the descriptor and to provide the CNN with an im¬ 
age of a fixed size, as this is required by many CNN 
architectures. 

SVM training. Given the region descriptor 
</>rcnn(x; R), the next step is to learn a SVM classi¬ 
fier to decide whether a region contains an object or 
background. Learning the SVM starts from a num¬ 
ber of example images Xi,...,xjv, each annotated 
with ground-truth regions R £ 72 gt (x,) and object 
labels c(R) £ {1,... C}. In order to learn a classifier 
for class c, R-CNN divides ground-truth 72 gt (x*) and 
candidate 72 (x) regions into positive and negative. In 
particular, ground truth regions R £ 72 gt (x) for class 
c(i?) = c are assigned a positive label y(R; c; r) = +1; 
other regions R are labelled as ambiguous y{R; c; r) = 
e and ignored if overlap]!?, R) > t = 0 with any 
ground truth region R £ 72. gt (x) of the same class 
c(R) = c. The remaining regions are labelled as 
negative. Here overlap]A, B) = \A fl B\/\A U B\ is 
the intersection-over-union overlap measure, and the 
threshold is set to r = 0.3. The SVM takes the form 
</>SVM ° </>rcnn(x; R ), where fsvM is a linear predic¬ 


tor (w c , </>rcnn) + b c learned using an SVM solver to 
minimise the regularised empirical hinge loss risk. 

Bounding box regression. Candidate bounding 
boxes are refitted to detected objects by using a 
CNN-based regressor as in [9]. Given a candidate 
bounding box R = (x,y,w,h), where (x,y) are its 
centre and (w, h) its width and height, a linear re¬ 
gressor estimates an adjustment d = (d x ,d y ,d w ,dh) 
that yields the new bounding box d[R] = ( wd x + 
x,hd y + y 1 we d ™, he dh ). In order to train this re¬ 
gressor, one collects for each ground truth region R* 
all the candidates R that overlap sufficiently with it 
(with an overlap of at least 0.5). Each pair (7?*, 7?) 
of regions is converted in a training input/output 
pair (</> cnv (x, 7?),d) for the regressor, where d is the 
adjustment required to transform R into R *, i.e. 
R* = d[7?]. The pairs are then used to train the 
regressor using ridge regression with a large regular- 
isation constant. The regressor itself takes the form 
d = Q]T</> cnv (resize(x|R)) +t c where <p cnv denotes the 
CNN restricted to the convolutional layers, as further 
discussed in Section [272] The regressor is further im¬ 
proved by retraining it after removing the 20% of the 
examples with worst regression loss - as found in the 
publicly-available implementation of SPP-CNN. 

Post-processing. The refined bounding boxes are 
passed to non-maxima suppression before being eval¬ 
uated. Non-maxima suppression eliminates duplicate 
detections prioritising regions with higher SVM score 
s(</>(7?)). Starting from the highest ranked region in 
an image, other regions are iteratively removed if they 
overlap by more than 0.3 with any region retained so 
far. 

CNN fine-tuning. The quality of the CNN fea¬ 
tures, ported from an image classification task, can 
be improved by fine-tuning the network on the tar¬ 
get data. In order to do so, the CNN />rcnn(x; 7?) 
is concatenated with additional layers <^ s ftmx (linear 
followed by softmax normalisation) to obtain a pre¬ 
dictor for the (7 + 1 object classes. The new CNN 
0sftmx ° />rcnn(x; 7?) is then trained as a classifier 
by minimising its empirical logistic risk on a train¬ 
ing set of labelled regions. This is analogous to 
the procedure used to learn the CNN in the first 
place, but with a reduced learning rate and a dif¬ 
ferent (and smaller) training set similar to the one 
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used to train the SVM. In this dataset, a region 
R, either ground-truth or candidate, is assigned the 
class c(i?;r + ,r_) = c(R*) of the closest ground- 
truth region R* = argmax^ gTC overlap(i?, R), 
provided that overlap( R, R*) > r+. If instead 
overlap(i?, R*) < r_, then the region is labelled as 
c(i?;r + ,r_) = 0 background, and the remaining re¬ 
gions as ambiguous. By default t + and t_ are both 
set 1/2, resulting in a much more relaxed training set 
than for the SVM. Since the dataset is strongly biased 
towards background regions, during CNN training it 
is rebalanced by sampling with 25% probability re¬ 
gions such that c(R) > 0 and with 75% probability 
regions such that c(R) = 0. 

2.2 SPP-CNN detector 

A significant disadvantage of R-CNN is the need to 
recompute the whole CNN from scratch for each eval¬ 
uated region; since this occurs thousands of times per 
image, the method is slow. SPP-CNN addresses this 
issue by factoring the CNN <j) = <f> f c o(J> cn v in two parts, 
where </> cnv contains the so-called convolutional lay¬ 
ers, pooling information from local regions, and <j> f c 
the fully connected (FC) ones, pooling information 
from the image as a whole. Since the convolutional 
layers encode local information , this can be selectively 
pooled to encode the appearance of an image subre¬ 
gion R instead of the whole image. In more detail, 
let y = </ cnv (x) the output of the convolutional layers 
applied to image x. The feature field yisa HxWxD 
tensor of height H and width W, proportional to the 
height and width of the input image x, and D feature 
channels. Let z = SP(y; R) be the result of applying 
the spatial pooling (SP) operator to the feature in y 
contained in region R. This operator is defined as: 

z d = max y ijd , d=l,...,D (1) 


feature field in the original CNN. In this manner, the 
output can be concatenated with the existing FC lay¬ 
ers: </>spp(x;i?) = c/f c oSPP(-; i?)o</ cnv (x). Note that, 
compared to R-CNN, the first part of the computa¬ 
tion is shared among all regions R. 

Next, we derive the map g that transforms fea¬ 
ture coordinates back to image coordinates as re¬ 
quired by 0 (this correspondence was established 
only heuristically in El). It suffices to consider one 
spatial dimension. The question is which pixel Xo(*o) 
corresponds to feature xl(*l) in the L-th layer of a 
CNN. While there is no unique definition, a useful 
one is to let *o be the centre of the receptive field of 
feature x L (i L ), defined as the set of pixels 
that can affect Xi(ix) as a function of the image (i.e. 
the support of the feature). A short calculation leads 
to 

^0 = Ql^l) = OLl{%L - 1 ) + Pin 

L 

OiL — | Sp, 

& = '+£( ER,) 

3 Simplifying and streamlining 
R-CNN 


This section describes the main technical contribu¬ 
tions of the paper: removing region proposal genera¬ 
tion from R-CNN (Section 3.1l and streamlining the 
pipeline (Section [T2|. 


3.1 Dropping region proposal genera¬ 
tion 


where the function g maps the feature coordi¬ 
nates (i,j) back to image coordinates g(i,j). The 
SP operator is extended to spatial pyramid pooling 
(SPP; [IS]) by dividing the region R into subregions 
R = R\ U i ?2 U ... Rk, applying the SP operator to 
each, and then stacking the resulting features. In 
practice, SSP-CNN uses K x K subdivisions, where 
K is chosen to match the size of the convolutional 


While the SPP method of .JIT] (Section 2.2) accel¬ 
erates R-CNN evaluation by orders of magnitude, it 
does not result in a comparable acceleration of the 
detector as a whole; in fact, proposal generation with 
SS is about ten time slower than SPP classification. 
Much faster proposal generators exist, but may not 
result in very accurate regions [221 . Here we pro¬ 
pose to drop 72.(x) entirely and to use instead an 
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Figure 2: Bounding box distributions using the normalised coordinates of Section |3.1| Rows show the 
histograms for the bounding box centre (x,y), size (w, h), and scale vs distance from centre (s, |c|). Column 
shows the statistics for ground-truth, selective search, restricted selective search, sliding window, and cluster 
bounding boxes. 


image-independent list of candidate regions TZq, us¬ 
ing the CNN itself to regress better object locations 
a-posteriori. 

Constructing R 0 starts by studying the distri¬ 
bution of bounding boxes in a representative ob¬ 
ject detection benchmark, namely the PASCAL 
VOC 207 data [6]. A box is defined by the tu¬ 
ple ( r s ,c s ,r e ,c e ) denoting the upper-left and lower- 
right corners coordinates ( r s ,c s ) and (r e ,c e ). Given 
an image of size H x W, define the normalised 
with and height as w = (c e — c e )/W and h = 
(r e — t s )/H respectively; define also the scale s = 
\/wh and distance from the image centre |c| = 
|| [(c a + c e )/2W - 0.5, (r s + r e )/2H - 0.5)] || 2 . 

The first column of Figure [2] shows the distribution 
of such parameters for the GT boxes in the PASCAL 
data. It is evident that boxes tend to appear close 
to the image centre and to fill the image. The statis¬ 
tics of SS regions differs substantially; in particular, 
the (s, |c|) histogram shows that SS boxes tend to 
distribute much more uniformly in scale and space 
compared to the GT ones. If SS boxes are restricted 
to the ones that have an overlap of at least 0.5 with 


a GT BB, then the distributions are similar again, 
with a strong preference for centred and large boxes. 

The fourth column shows the distribution of boxes 
generated by a sliding window (SW; [I]) object de¬ 
tector. For an “exhaustive” enumeration of boxes 
at all location, scales, and aspect ratios, there can 
be hundred of thousands boxes per image. Here we 
subsample this set to 7K in order to obtain a candi¬ 
date set with a size comparable to SS. This was ob¬ 
tained by sampling the width of the bounding boxes 
as w = wq2 1 ,1 = 0,0.5,... 4 where wq ~ 40 pixels is 
the width of the smallest bounding box considered in 
the SSP-CNN detector. Similarly, aspect ratios are 
sampled as 2^ —1,— °- 75 ’ - 1 }. The distribution of boxes, 
visualised in the fourth column of Figure [2] is similar 
to SS and dissimilar from GT. 

A simple modification of sliding window is to bias 
sampling to match the statistics of the GT bounding 
boxes. We do so by computing n K-means clusters 
from the collection of vectors (r s , c s , r e , c e ) obtained 
from the GT boxes in the PASCAL VOC training 
data. We call this set of boxes TZo(n); the fifth column 
of Figure [2] shows that, as expected, the correspond¬ 
ing distribution matches nicely the one of GT. Sec- 
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tion [4] shows empirically that, when combined with a 
CNN-based bounding box regressor, this proposal set 
results in a very competitive (and very fast) detector. 

3.2 Streamlined detection pipeline 

This section proposes several simplifications to the 
R/SPP-CNN pipelines complementary to dropping 
region proposal generation as done in Section |3.1[ 
As a result of all these changes, the whole detec¬ 
tor, including detection of multiple object classes 
and bounding box regression, reduces to evaluating 
a single CNN. Furthermore, the pipeline is straight¬ 
forward to implement on GPU, and is sufficiently 
memory-efficient to process multiple images at once. 
In practice, this results in an extremely fast detector 
which still retains excellent performance. 

Dropping the SVM. As discussed in Section |2.1[ 
R-CNN involves training an SVM classifier for each 
target object class as well as fine-tuning the CNN fea¬ 
tures for all classes. An obvious question is whether 
SVM training is redundant and can be eliminated. 

Recall from Section |2.1| that fine-tuning learns a 
softmax predictor </> s f tmx on top of R-CNN features 
</>rcnn(x;- ff), whereas SVM training learns a linear 
predictor </>svm on top of the same features. In the 
first case, P c = P(c|x,R) = [</> sftmx o </>rcnn(x; I?)] c 
is an estimate of the class posterior for region R ; 
in the second case S c = [<^svm ° <(>rcnn(x; i?)] c is 
a score that discriminates class c from any other 
class (in both cases background is treated as one 
of the classes). As verified in Section [4] and Ta¬ 
ble [TJ P c works poorly as a score for an object de¬ 
tector; however, and somewhat surprisingly, using as 
score the ratio S' c = P c /Pq results in performance 
nearly as good as using an SVM. Further, note that 
0sftmx can be decomposed as C + 1 linear predictors 
(w Cl <^rcnn) +b c followed by exponentiation and nor¬ 
malisation; hence the scores S' c reduces to the expres¬ 
sion S' c = exp (( w c - w 0 , </>rcnn) + b c - b 0 ). 

Integrating SPP and bounding box regres¬ 
sion. While in the original implementation of 
SPP m the pooling mechanism is external to the 
CNN software, we implement it directly as a layer 
SPP(-; Ri ,..., R n ). This layer takes as input a ten¬ 
sor representing the convolutional features <p cnv (x) £ 


Evaluation method 

BB regression 

Single scale 
no yes 

Multi scale 
no yes 

S c (SVM) 

54.0 

58.6 

56.3 

59.7 

P c (softmax) 

27.9 

34.5 

30.1 

38.1 

Pc/Pq (modified softmax) 

54.0 

58.0 

55.3 

58.4 


Table 1: Evaluation of SPP-CNN with and without 
the SVM classifier. The table report mAP on the 
PASCAL VOC 2007 test set for the single scale and 
multi scale detector, with or without bounding box 
regression. Different rows compare different bound¬ 
ing box scoring mechanism of Section |3.2[ the SVM 
scores S c , the softmax posterior probability scores P c , 
and the modified softmax scores P c /Po- 

R hxWxd an( ^ outputs n feature fields of size h x 
w x D, one for each region Pi,. .., R n passed as in¬ 
put. These fields can be stacked in a 4D output ten¬ 
sor, which is supported by all common CNN soft¬ 
ware. Given a dual CPU/GPU implementation of 
the layer, SPP integrates seamlessly with most CNN 
packages, with substantial benefit in speed and flexi¬ 
bility, including the possibility of training with back- 
propagation through it. 

Similar to SPP, bounding box regression is easily 
integrated as a bank of filters ( Q c , b c ), c = 1,..., C 
running on top of the convolutional features 0 cnv (x). 
This is cheap enough that can be done in parallel for 
all the object classes. 

Scale-augmented training, single scale evalu¬ 
ation. While SPP is fast, one of the most time 
consuming step is to evaluate features at multiple 
scales m- However, the authors of m also indi¬ 
cate that restricting evaluation to a single scale has 
a marginal effect in performance. Here, we maintain 
the idea of evaluating the detector at test time by 
processing each image at a single scale. However, 
this requires the CNN to explicitly learn scale invari¬ 
ance, which is achieved by fine-tuning the CNN using 
randomly rescaled versions of the training data. 

4 Experiments 

This section evaluates the changes to R-CNN and 
SPP-CNN proposed in Section [3] All experiments 
use the Zeiler and Fergus (ZF) small CNN [21] as 
this is the same network used by EJ that introduce 
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method 

mAP 

aero 

bike 

bird 

boat 

bottle 

bus 

car 

cat 

chair 

cow 

table 

dog 

horse 

mb ike 

person 

plant 

sheep 

sofa 

train 

tv 

SVM MS 

59.68 

66.8 

75.8 

55.5 

43.1 

38.1 

66.6 

73.8 

70.9 

29.2 

71.4 

58.6 

65.5 

76.2 

73.6 

57.4 

29.9 

60.1 

48.4 

66.0 

66.8 

SVM SS 

58.60 

66.1 

76.0 

54.9 

38.6 

32.4 

66.3 

72.8 

69.3 

30.2 

67.7 

63.7 

66.2 

72.5 

71.2 

56.4 

27.3 

59.5 

50.4 

65.3 

65.2 

FC8 MS 

58.38 

69.2 

75.2 

53.7 

40.0 

33.0 

67.2 

71.3 

71.6 

26.9 

69.6 

60.3 

64.5 

74.0 

73.4 

55.6 

25.3 

60.4 

47.0 

64.9 

64.4 

FC8 SS 

57.99 

67.0 

75.0 

53.3 

37.7 

28.3 

69.2 

71.1 

69.7 

29.7 

69.1 

62.9 

64.0 

72.7 

71.0 

56.1 

25.6 

57.7 

50.7 

66.5 

62.3 

FC8 C3k MS 

53.41 

55.8 

73.1 

47.5 

36.5 

17.8 

69.1 

55.2 

73.1 

24.4 

49.3 

63.9 

67.8 

76.8 

71.1 

48.7 

27.6 

42.6 

43.4 

70.1 

54.5 

FC8 C3k SS 

53.52 

55.8 

73.3 

47.3 

37.3 

17.6 

69.3 

55.3 

73.2 

24.0 

49.0 

63.3 

68.2 

76.5 

71.3 

48.2 

27.1 

43.8 

45.1 

70.2 

54.6 


Table 2: Comparison of different variants of the SPP-CNN detector. First group of rows: original SPP-CNN 
using Multi Scale (MS) or Single Scale (SS) detection. Second group: the same experiment, but dropping 
the SVM and using the modified softmax scores of Section |3.2[ Third group: SPP-CNN without region 
proposal generation , but using a fixed set of 3K candidate bounding boxes as explained in Section |3.1[ 


SPP-CNN. While more recent networks such as the 
very deep models of Simonyan and Zisserman [16] are 
likely to perform better, this choice allows to compare 
directly m- The detector itself is trained and eval¬ 
uated on the PASCAL VOC 2007 data 0, as this is 
a default benchmark for object detection and is used 
in m as well. 

Dropping the SVM. The first experiment evalu¬ 
ates the performance of the SPP-CNN detector with 
or without the linear SVM classifier, comparing the 
bounding box scores S c (SVM), P c (softmax), and 
S' c (modified softmax) of Section [j~2] As can be seen 
in Table [I] and Table [2j the best performing method 
is SSP-CNN evaluated at multiple scales, resulting 
in 59.7% mAP on the PASCAL VOC 2007 test data 
(this number matches the one reported in [XT] , vali¬ 
dating our implementation). Removing the SVM and 
using the CNN softmax scores directly performs re¬ 
ally poorly, with a drop of 21.6% mAP point. How¬ 
ever, adjusting the softmax scores using the simple 
formula P c /Pq restores the performance almost en¬ 
tirely, back to 58.4% mAP. While there is still a 
small 1.3% drop in mAP accuracy compared to using 
the SVM, removing the latter dramatically simplifies 
the detector pipeline, resulting in particular in sig¬ 
nificantly faster training as it removes the need of 
preparing and caching data for the SVM (as well as 
learning it). 

Multi-scale evaluation. The second set of exper¬ 
iments assess the importance of performing multi¬ 
scale evaluation of the detector. Results are reported 
once more in Tables [T] and [2j Once more, multi-scale 
detection is the best performing method, with per¬ 
formance up to 59.7% mAP. However, single scale 
testing is very close to this level of performance, at 
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Figure 3: mAP on the PASCAL VOC 2007 test data 
as a function of the number of candidate boxes per 
image, proposal generation method, and using or not 
bounding box regression. 

58.6%, with a drop of just 1.1% mAP points. Just 
like when removing the SVM, the resulting simplifi¬ 
cation and in this case detection speedup make this 
drop in accuracy more than tolerable. In particular, 
testing at a single scale accelerates detection roughly 
five-folds. 


Dropping region proposal generation. The next 
experiment evaluates replacing the SS region propos¬ 
als 72.ss( x ) w bh the fixed proposals lZo(n) as sug¬ 
gested in Section |3.1[ Table [2] shows the detection 


performance for n = 3,000, a number of candidates 
comparable with the 2,000 extracted by selective 
search. While there is a drop in performance com¬ 
pared to using SS, this is small (59.68% vs 53.41%, i.e. 
a 6.1% reduction), which is surprising since bounding 
box proposals are now oblivious of the image content. 

Figure [3] looks at these results in greater detail. 
Three bounding box generation methods are com¬ 
pared: selective search, sliding windows, and cluster¬ 
ing (see also Section 3.1), with or without bounding 
box regression. Neither clustering nor sliding win¬ 
dows result in an accurate detector: even if the num- 
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Impl. [msj 

SelS 

Prep. Move Conv SPP FC BBR 

E - SelS 

SPP MS 
OURS Mb 

1.98 • 10 3 

23.3 67.5 186.6 211.1 91.0 39.8 
23.7 17.7 179.4 38.9 87.9 9.8 

619.2 ±118.0 
357.4 ±34.3 

SPP „ 

OURS 

9.0 47.7 31.1 207.1 90.4 39.9 
9.0 3.0 30.3 19.4 88.0 9.8 

425.1 ±117.0 
159.5 ±31.5 


Table 3: Timing (in ms) of the original SPP-CNN and our streamlined full-GPU implementation, broken 
down into selective search (SS) and preprocessing: image loading and scaling (Prep), CPU/GPU data 
transfer (Move), convolution layers (Conv), spatial pyramid pooling (SPP), fully connected layers (FC), and 
bounding box regression (BBR). 


ber of candidate boxes is increased substantially (up 
to n = 7K), performance saturates at around 46% 
mAP. This is much poorer than the ~56% achieved 
by selective search. Bounding box regression im¬ 
proves selective search by about 3% mAP, up to 
~59%, but it has a much more significant effect on the 
other two methods, improving performance by about 
10% mAP. Note that clustering with 3K candidates 
performs as well as sliding window with 7K. 

We can draw several interesting conclusions. First, 
for the same low number of candidate boxes, selec¬ 
tive search is much better than any fixed proposal set; 
less expected is that performance does not increase 
even with 3x more candidates, indicating that the 
CNN is unable to tell which bounding boxes wrap ob¬ 
jects better even when tight boxes are contained in 
the shortlist of proposals. This can be explained by 
the high degree of geometric invariance in the CNN. 
At the same time, the CNN-based bounding box re¬ 
gressor can make loose bounding boxes significantly 
tighter, which requires geometric information to be 
preserved by the CNN. This apparent contradiction 
can be explained by noting that bounding box classi¬ 
fication is built on top of the FC layers of the CNN, 
whereas bounding box regression is built on the con¬ 
volutional ones. Evidently, geometric information is 
removed in the FC layers, but is still contained in the 
convolutional layers (see also Figure [I]). 

Detection speed. The last experiment (Table [3]) 
evaluates the detection speed of SPP-CNN (which is 
already orders of magnitude faster than R-CNN) and 
our streamlined implementation. Not counting SS 
proposal generation, the streamlined implementation 
is between 1.7x (multi-scale) to 2.6x (single-scale) 
faster than original SPP, with the most significant 
gain emerging from the integrated SPP and bound¬ 
ing box regression implementation on GPU and con¬ 
sequent reduction of data transfer cost between CPU 


and GPU. 

As suggested before, however, the bottleneck is se¬ 
lective search. Compared to the slowest MS SPP- 
CNN implementation of tm, using all the simpli¬ 
fications of Section [3j including removing selective 
search, results in an overall detection speedup of more 
than 16x, from about 2.5s per image down to 160ms 
(this at a reduction of about 6% mAP points). 


5 Conclusions 

Our most significant finding is that current CNNs do 
contain sufficient geometric information for accurate 
object detection, although in the convolutional rather 
than fully connected layers. This finding opens the 
possibility of building state-of-the-art object detec¬ 
tors that rely exclusively on CNNs, removing region 
proposal generation schemes such as selective search, 
and resulting in integrated, simpler, and faster detec¬ 
tors. 

Our current implementation of a proposal-free de¬ 
tector is already much faster than SPP-CNN, and 
very close, but not quite as good, in term of mAP. 
However, we have only begun exploring the design 
possibilities and we believe that it is a matter of time 
before the gap closes entirely. In particular, our cur¬ 
rent scheme is likely to miss small objects in the im¬ 
age. These may be retained by alternative methods 
to search the object pose space, such as for example 
Hough voting on top of convolutional features, which 
would maintain the computational advantage and el¬ 
egance of integrated and streamlined CNN detectors 
while allowing to thoroughly search the image for ob¬ 
ject occurrences. 
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